Description of the project

Given two data sets \(\texttt{sales.txt}\) and \(\texttt{item_size.txt}\), out goal is to provide a 52-week demand forcast. We first import, clean, and transform the data that will facilite time-series analysis.

size = read.delim('item_size.txt', header = TRUE, sep = "|", dec = ".")
sales = read.delim('sales.txt',header = TRUE, sep = "|", dec =".")

#Make the transaction_date a date variable
sales$transaction_date = as.Date(sales$transaction_date)

We take an initial look at a sample of each data, and a histogram of transaction dates binned into week.

head(size)
##                                item_code item_size
## 1 1012-00001-0189-010-999400191165169227     M REG
## 2 1012-00001-0782-001-999500889587049998     L REG
## 3 1012-00001-0613-060-999700190542543223   XXL REG
## 4 1012-00001-5003-040-999600194112525862    XL REG
## 5               1012-00001-0616-930-9995     L REG
## 6 1012-00001-4616-535-999600191932476701    XL REG
head(sales)
##   store                              item_code transaction_date
## 1     1 1103-00001-0035-248-999300706420416901       2016-01-15
## 2     1                                    TAX       2016-01-04
## 3     1                                    TAX       2016-01-03
## 4     1                                    TAX       2016-01-05
## 5     1 1013-00001-0036-001-999500881862256954       2016-01-05
## 6     1 1012-00001-0063-002-999300881862256190       2016-01-30
##   transaction_quantity retail_amount
## 1                    1         49.99
## 2                    0          0.00
## 3                    0          0.00
## 4                    0          0.00
## 5                    1         39.99
## 6                    1         97.19
#Histogram of Sales since 2015
hist(sales$transaction_date, "weeks", freq=TRUE, main = "Histogram of Transaction Date", xlab="Transaction Date")

Cleaning the data

We notice many of the lines in the sales data involve TAX. In these lines contribute no meaningful information and so we remove them.

#Remove the TAX entries
sales = sales[sales$item_code !="TAX",]

Examining the counts of transaction quantity we can examine any patterns in the outliers.

as.data.frame(table(sales$transaction_quantity))
##    Var1   Freq
## 1   -10      1
## 2    -1  42048
## 3     0    118
## 4     1 751524
## 5     2    294
## 6     3     29
## 7     4      4
## 8     5      3
## 9     6      2
## 10    8      1
## 11   10      2
## 12   11      1
## 13   16      1
## 14   20      1
## 15   22      1
## 16   25      1
## 17   35      1
## 18   36      1
## 19   37      1

We see that overwhelimingly, each transaction accounts for the purchase of one product, and very few purchase more than 4 in a single transaction. Let’s examine the dates of these large purchases.

sales[sales$transaction_quantity>7,]
##         store                              item_code transaction_date
## 82181       1 6205-00001-0502-001-000000053329460611       2016-04-30
## 87314       1 6205-00001-0502-001-000000053329460611       2016-04-30
## 953267      1 1014-00001-0029-001-999500191165759831       2019-03-30
## 983380      1 1014-00001-0029-001-999400191165759756       2019-03-30
## 993292      1 1104-00001-0009-001-999400191932492169       2019-03-26
## 997037      1 1014-00001-0029-001-999700191165759985       2019-04-09
## 1032512     1 1014-00001-0029-001-999600191165759916       2019-04-09
## 1051676     1 1104-00001-0009-001-999500191932492244       2019-03-27
## 1069625     1 1104-00001-0009-001-999600191932491698       2019-03-29
## 1102735     1 1104-00001-0009-001-999300191932492060       2019-03-25
## 1140528     2 7010-00001-0051-376-000000192826309600       2019-08-14
##         transaction_quantity retail_amount
## 82181                     10          69.9
## 87314                     10       -1100.1
## 953267                    22         792.0
## 983380                    11         396.0
## 993292                    35        1295.0
## 997037                     8         288.0
## 1032512                   16         576.0
## 1051676                   36        1332.0
## 1069625                   37        1369.0
## 1102735                   25         925.0
## 1140528                   20        1602.0

We see that almost all of these major purchases happen between March 25th 2019 and April 9th 2019. What was going on during these day? The North Face and Supreme collaboration. We also remove 3 other extreme purchases that appear erroneous.

#Remove Supreme/North Face Colab
sales  = sales[!(sales$transaction_quantity>2 & sales$transaction_date <= "2019-04-10" & sales$transaction_date >= "2019-03-20"),]

#Remove Sales with more than 6, only 3 exist, and they seem erroneous. Also, returns with more than 1 (only one instance).
sales = sales[sales$transaction_quantity<=6,]
sales = sales[!(sales$transaction_quantity <=-2),]

Group the data into week’s

Because we are making a 52-week forecast, we need to collapse all the observations that appear in each week into a single observation. We label each observation with a new column corresponding to the previous (or current) Sunday using the $ function.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
dates <- sales$transaction_date
week_start_date = floor_date(sales$transaction_date , "weeks") 
sales = cbind(sales, week_start_date)

Transforming The Data

We want to transform the data to be useful for time-series analysis. We want to have a column for the date, and a column for the total transaction quantity. We will do this for both stores, and total sales. This will result in 3 data sets that we write out to both .csv and .txt files. We begin with Store 1, which corresponds to the discount store. These 3 data sets do not include any size data.

#Store 1 Sales Data
store1_weekly_transactions = data.frame(matrix(0, ncol = 1, nrow = length(unique(week_start_date))))
names(store1_weekly_transactions) = c("transaction_quantity")
r = 1
for (j in 1:length(unique(week_start_date)))
{
  store1_weekly_transactions[r,1] = sum(sales[sales$week_start_date == unique(week_start_date)[j] & sales$store == 1,]$transaction_quantity)
  r = r+1
}

store = c(rep(1,length(unique(week_start_date))))
store1_weekly_transactions = cbind(store, unique(week_start_date), store1_weekly_transactions)
names(store1_weekly_transactions) = c("store","week_start_date", "transaction_quantity")
store1_weekly_transactions = store1_weekly_transactions[order(store1_weekly_transactions$week_start_date),]

write.csv(store1_weekly_transactions,"discount_without_size.csv",row.names = FALSE)
write.table(store1_weekly_transactions, "discount_without_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)

Let’s view a sample of this new data set.

head(store1_weekly_transactions)
##   store week_start_date transaction_quantity
## 2     1      2016-01-03                 1779
## 1     1      2016-01-10                 1504
## 6     1      2016-01-17                 1598
## 3     1      2016-01-24                 1774
## 4     1      2016-01-31                 1540
## 8     1      2016-02-07                 2073

Similarly, but for store 2, which corresponds to the regular store.

#Store 2 sales data
store2_weekly_transactions = data.frame(matrix(0, ncol = 1, nrow = length(unique(week_start_date))))
names(store2_weekly_transactions) = c("transaction_quantity")
r = 1
for (j in 1:length(unique(week_start_date)))
{
  store2_weekly_transactions[r,1] = sum(sales[sales$week_start_date == unique(week_start_date)[j] & sales$store == 2,]$transaction_quantity)
  r = r+1
}

store = c(rep(2,length(unique(week_start_date))))
store2_weekly_transactions = cbind(store, unique(week_start_date),store2_weekly_transactions)
names(store2_weekly_transactions) = c("store", "week_start_date", "transaction_quantity")
store2_weekly_transactions = store2_weekly_transactions[order(store2_weekly_transactions$week_start_date),]

head(store2_weekly_transactions)
##   store week_start_date transaction_quantity
## 2     2      2016-01-03                    0
## 1     2      2016-01-10                    0
## 6     2      2016-01-17                    0
## 3     2      2016-01-24                    0
## 4     2      2016-01-31                    0
## 8     2      2016-02-07                    0
write.csv(store2_weekly_transactions,"regular_without_size.csv",row.names = FALSE)
write.table(store2_weekly_transactions, "regular_without_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)

We lastly do a combined analysis where we don’t separate by store, and get total sales accross both stores.

#Total sales data
total_weekly_transactions = data.frame(matrix(0, ncol = 1, nrow = length(unique(week_start_date))))
names(total_weekly_transactions) = c("transaction_quantity")
r = 1
for (j in 1:length(unique(week_start_date)))
{
  total_weekly_transactions[r,1] = sum(sales[sales$week_start_date == unique(week_start_date)[j],]$transaction_quantity)
  r = r+1
}

total_weekly_transactions = cbind(unique(week_start_date), total_weekly_transactions)
names(total_weekly_transactions) = c("week_start_date","transaction_quantity")
total_weekly_transactions = total_weekly_transactions[order(total_weekly_transactions$week_start_date),]

head(total_weekly_transactions)
##   week_start_date transaction_quantity
## 2      2016-01-03                 1779
## 1      2016-01-10                 1504
## 6      2016-01-17                 1598
## 3      2016-01-24                 1774
## 4      2016-01-31                 1540
## 8      2016-02-07                 2073
write.csv(total_weekly_transactions,"total_without_size.csv",row.names = FALSE)
write.table(total_weekly_transactions, "total_without_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)

We will now create 3 more data set where we get a different column for each size. That is, for each week (a row) we get the total transactions for each size. At this point it is important that we make an observations. The number of unique item_codes in the sales data is:

length(unique(sales$item_code))
## [1] 66984

While the number of unique item_codes in the size data is:

length(unique(size$item_code))
## [1] 16826

We see that there are many more unique items in the sales data than in the size data. This could mean a number of things. 1) The sales data is not just mens coats, but rather an item sold at the north face store, e.g. shirts, backpacks, hats. Including things that simply don’t have any particular size. 2) The sales data is has coats that are not contained in the size data. Given that the description of the problem indicated that the item_code in the sales data was the “Unique identifier for men’s insulated jackets”, I will option number 2.

Since many of the coats in the sales data do not have recorded sizes, when we merge \(\texttt{size}\) and \(\texttt{sales}\), we only keep the records which have sizes, thus schrinking out dataset considerably.

For this transformation we run a nested for loop to get all the transactions for each week and for each size. We do this for store 1, store 2, and total sales, producing 3 more data sets.

#Get sales data with size column
sales.plus.size = merge(x=sales, y=size, by="item_code")

#Store 1 + sizes
store1_weekly_transactions.plus.size = data.frame(matrix(0, ncol = length(unique(sales.plus.size$item_size)), nrow = length(unique(week_start_date))))

r = 1
for (j in 1:length(unique(week_start_date)))
{
  c=1
  for (k in 1:length(unique(sales.plus.size$item_size)))
  {
    store1_weekly_transactions.plus.size[r,c] = sum(sales.plus.size[sales.plus.size$week_start_date == unique(week_start_date)[j] & sales.plus.size$item_size ==unique(sales.plus.size$item_size)[k] & sales.plus.size$store ==1 ,]$transaction_quantity)
    c = c+1
  }
  r=r+1
}

store1_weekly_transactions.plus.size = cbind(unique(week_start_date), store1_weekly_transactions.plus.size)
names(store1_weekly_transactions.plus.size) = c("week_start_date",sapply(unique(sales.plus.size$item_size),toString))
store1_weekly_transactions.plus.size = store1_weekly_transactions.plus.size[order(store1_weekly_transactions.plus.size$week_start_date),]

head(store1_weekly_transactions.plus.size)
##   week_start_date L REG S REG M REG XL REG XXL REG XS REG XXS REG 3XL REG
## 2      2016-01-03    60    29    52     29       8      6       0       0
## 1      2016-01-10    28    23    26     17      10      8       0       0
## 6      2016-01-17    50    20    47     26      10      3       0       0
## 3      2016-01-24    46    13    32     19       9      5       0       0
## 4      2016-01-31    25     8    27     10       6      9       0       0
## 8      2016-02-07    48    21    48     36      16      3       0       0
write.csv(store1_weekly_transactions.plus.size,"discount_with_size.csv",row.names = FALSE)
write.table(store1_weekly_transactions.plus.size, "discount_with_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)

Similarly for store 2.

#Store 2 + sizes
store2_weekly_transactions.plus.size = data.frame(matrix(0, ncol = length(unique(sales.plus.size$item_size)), nrow = length(unique(week_start_date))))

r = 1
for (j in 1:length(unique(week_start_date)))
{
  c=1
  for (k in 1:length(unique(sales.plus.size$item_size)))
  {
    store2_weekly_transactions.plus.size[r,c] = sum(sales.plus.size[sales.plus.size$week_start_date == unique(week_start_date)[j] & sales.plus.size$item_size ==unique(sales.plus.size$item_size)[k] & sales.plus.size$store ==2,]$transaction_quantity)
    c = c+1
  }
  r=r+1
}

store2_weekly_transactions.plus.size = cbind(unique(week_start_date), store2_weekly_transactions.plus.size)
names(store2_weekly_transactions.plus.size) = c("week_start_date",sapply(unique(sales.plus.size$item_size),toString))
store2_weekly_transactions.plus.size = store2_weekly_transactions.plus.size[order(store2_weekly_transactions.plus.size$week_start_date),]

head(store2_weekly_transactions.plus.size)
##   week_start_date L REG S REG M REG XL REG XXL REG XS REG XXS REG 3XL REG
## 2      2016-01-03     0     0     0      0       0      0       0       0
## 1      2016-01-10     0     0     0      0       0      0       0       0
## 6      2016-01-17     0     0     0      0       0      0       0       0
## 3      2016-01-24     0     0     0      0       0      0       0       0
## 4      2016-01-31     0     0     0      0       0      0       0       0
## 8      2016-02-07     0     0     0      0       0      0       0       0
write.csv(store2_weekly_transactions.plus.size,"regular_with_size.csv",row.names = FALSE)
write.table(store2_weekly_transactions.plus.size, "regular_with_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)

Lastly, we do the same thing, but for total sales accross both stores.

#Total + size
total_weekly_transactions.plus.size = data.frame(matrix(0, ncol = length(unique(sales.plus.size$item_size)), nrow = length(unique(week_start_date))))

r = 1
for (j in 1:length(unique(week_start_date)))
{
  c=1
  for (k in 1:length(unique(sales.plus.size$item_size)))
  {
    total_weekly_transactions.plus.size[r,c] = sum(sales.plus.size[sales.plus.size$week_start_date == unique(week_start_date)[j] & sales.plus.size$item_size ==unique(sales.plus.size$item_size)[k],]$transaction_quantity)
    c = c+1
  }
  r=r+1
}

total_weekly_transactions.plus.size = cbind(unique(week_start_date), total_weekly_transactions.plus.size)
names(total_weekly_transactions.plus.size) = c("week_start_date",sapply(unique(sales.plus.size$item_size),toString))
total_weekly_transactions.plus.size = total_weekly_transactions.plus.size[order(total_weekly_transactions.plus.size$week_start_date),]

head(total_weekly_transactions.plus.size)
##   week_start_date L REG S REG M REG XL REG XXL REG XS REG XXS REG 3XL REG
## 2      2016-01-03    60    29    52     29       8      6       0       0
## 1      2016-01-10    28    23    26     17      10      8       0       0
## 6      2016-01-17    50    20    47     26      10      3       0       0
## 3      2016-01-24    46    13    32     19       9      5       0       0
## 4      2016-01-31    25     8    27     10       6      9       0       0
## 8      2016-02-07    48    21    48     36      16      3       0       0
write.csv(total_weekly_transactions.plus.size,"total_with_size.csv",row.names = FALSE)
write.table(total_weekly_transactions.plus.size, "total_with_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)

Creating a Forecast

I will create the following model: a 52-week projection for total sales and the distribution of appropriate sizes for each week. Let’s first demonstrate a forcast for just the discount store with no consideration of sizing. We will use a seasonal ARIMA model.

#Load in the data and make it timeseries data. 
discount_without_size = read.delim('discount_without_size.txt', header = TRUE, sep = "\t", dec = ".")
ts_dis = ts(discount_without_size$transaction_quantity, start=c(2016,1), freq=52)

#remove the last entry since the week is incomplete. 
ts_dis.red = ts_dis[1:202]

library(astsa)
## 
## Attaching package: 'astsa'
## The following object is masked _by_ '.GlobalEnv':
## 
##     sales
library(forecast)
## 
## Attaching package: 'forecast'
## The following object is masked from 'package:astsa':
## 
##     gas

We first need to examine what type of differences to take with the data. It is usually the case to take one difference in seasonal trends, and if there is still a positive or negative trend in the data, we take an additional difference between neighboring points, for the purpose of producing a stationary model. In the situation below, we take both types of differences and in the end get what appears to be a stationary series.

plot(ts_dis.red, type = "l",main="Transactions From Discount Store", xlab="Week's Since 2016-01-03",ylab="Transactions")

#take seasonal differences
plot(diff(ts_dis.red,52),type="l",main="Diff(TS,12)", xlab="Week's Since 2016-01-03",ylab="Transactions")

#Still a trend, so take neiboring differences
plot(diff(diff(ts_dis.red,52)),type="l",main="Diff(Diff(TS,12),1)", xlab="Week's Since 2016-01-03",ylab="Transactions")

Next, we analyze the auto-correlations and partial autocorrelations to determine what our orders for the AR, MA and Seasonal AR and MA should be. A rule of thumb for deciding which orders are to be used in the model is the following: Early spikes in ACF correspond to the order in the MA part. Early spikes in the PACF correspond to the order of the AR part of the model. If we see spikes at the lag correspond to the length of the season (52 in our case), we similarly get the order for the season AR and MA part.

b = acf2(diff(diff(ts_dis.red,52)), 100)

In our situation we see two early spikes in the ACF an 2 early spikes in PACF, so our non-season parameters can be (2,1,2). We see one spike around lag 52 in both, so our seasonal parameters can be (1,1,1). Running this forecast we get a forecast for 52 weeks and a plot with 80% and 95% CI’s.

model = Arima(ts_dis.red, order=c(2,1,2), seasonal=list(order=c(1,1,1), period=52), method="CSS") 
summary(model)
## Series: ts_dis.red 
## ARIMA(2,1,2)(1,1,1)[52] 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2     sar1     sma1
##       -0.6055  0.2713  0.1270  -0.9941  -0.0153  -0.6906
## s.e.   0.0274  0.0092  0.0299   0.0338   0.0759   0.1459
## 
## sigma^2 estimated as 177984:  part log likelihood=-1142.55
## 
## Training set error measures:
##                     ME    RMSE      MAE       MPE     MAPE      MASE
## Training set -27.36349 354.963 182.8613 -2.127001 5.776429 0.1916225
##                      ACF1
## Training set 0.0008177134
forecast(model,h=52)
##     Point Forecast     Lo 80     Hi 80      Lo 95     Hi 95
## 203      11455.273 10862.335 12048.211 10548.4524 12362.093
## 204       4021.704  3383.073  4660.335  3045.0029  4998.405
## 205       5088.672  4443.538  5733.807  4102.0242  6075.321
## 206       6618.490  5968.022  7268.957  5623.6854  7613.294
## 207       9527.449  8875.185 10179.712  8529.8978 10525.000
## 208       6828.120  6173.166  7483.075  5826.4535  7829.787
## 209       3472.566  2823.187  4121.945  2479.4265  4465.706
## 210       2039.902  1387.461  2692.343  1042.0796  3037.725
## 211       1906.504  1252.370  2560.638   906.0920  2906.915
## 212       2040.750  1383.729  2697.770  1035.9238  3045.576
## 213       1872.799  1214.000  2531.599   865.2527  2880.346
## 214       1987.719  1326.168  2649.270   975.9646  2999.474
## 215       2726.064  2062.663  3389.464  1711.4799  3740.648
## 216       3039.414  2373.372  3705.456  2020.7904  4058.037
## 217       2520.037  1852.089  3187.986  1498.4979  3541.577
## 218       2600.226  1929.726  3270.725  1574.7856  3625.666
## 219       2867.705  2195.254  3540.155  1839.2802  3896.129
## 220       2671.108  1996.182  3346.033  1638.8984  3703.317
## 221       2422.761  1745.850  3099.672  1387.5149  3458.008
## 222       2090.052  1410.730  2769.375  1051.1179  3128.986
## 223       1950.339  1269.005  2631.674   908.3286  2992.350
## 224       2332.948  1649.256  3016.640  1287.3315  3378.565
## 225       2453.653  1767.930  3139.375  1404.9304  3502.375
## 226       2362.214  1674.179  3050.248  1309.9554  3414.472
## 227       2425.184  1735.105  3115.262  1369.7997  3480.568
## 228       2924.780  2232.429  3617.131  1865.9203  3983.640
## 229       3934.165  3239.761  4628.568  2872.1664  4996.163
## 230       3892.398  3195.756  4589.040  2826.9762  4957.820
## 231       2404.267  1705.568  3102.966  1335.6996  3472.835
## 232       2898.028  2197.121  3598.936  1826.0825  3969.974
## 233       3069.887  2366.920  3772.853  1994.7926  4144.980
## 234       3971.444  3266.295  4676.592  2893.0119  5049.875
## 235       4583.853  3876.647  5291.059  3502.2753  5665.431
## 236       2721.750  2012.385  3431.115  1636.8703  3806.631
## 237       2822.513  2111.094  3533.932  1734.4915  3910.535
## 238       3657.248  2943.691  4370.805  2565.9565  4748.540
## 239       4454.504  3738.897  5170.110  3360.0782  5548.929
## 240       4582.695  3864.969  5300.420  3485.0280  5680.361
## 241       4170.466  3450.698  4890.235  3069.6757  5271.257
## 242       3926.459  3204.588  4648.329  2822.4532  5030.465
## 243       4607.876  3883.970  5331.782  3500.7579  5714.994
## 244       4540.939  3814.947  5266.931  3430.6295  5651.248
## 245       2459.960  1731.941  3187.979  1346.5513  3573.369
## 246       3293.390  2563.299  4023.481  2176.8122  4409.968
## 247       4786.692  4054.584  5518.800  3667.0293  5906.355
## 248       4924.382  4190.215  5658.549  3801.5698  6047.194
## 249       4839.099  4102.924  5575.273  3713.2173  5964.980
## 250       4126.613  3388.392  4864.835  2997.6012  5255.626
## 251       3775.008  3034.790  4515.225  2642.9431  4907.072
## 252       3570.063  2827.810  4312.317  2434.8846  4705.242
## 253       3966.010  3221.773  4710.248  2827.7971  5104.223
## 254       2930.625  2184.361  3676.889  1789.3124  4071.938
plot(forecast(model,h=52), main = "Forcast for Discount Store")

Due to the size of the output from the forecast function, I will hide its output for the remainder. Another way we can pick the parameters is to use the auto.arima function which runs a grid search over the parameters and picks the model with the best BIC. This can sometime produce results that might seem suspicious, i.e. no seasonal part.

#this model agrees with a grid search over BIC
auto.arima(ts_dis)
## Series: ts_dis 
## ARIMA(0,1,2)(1,1,0)[52] 
## 
## Coefficients:
##           ma1      ma2     sar1
##       -0.5443  -0.3921  -0.2776
## s.e.   0.0827   0.0819   0.1011
## 
## sigma^2 estimated as 397528:  log likelihood=-1173.52
## AIC=2355.04   AICc=2355.32   BIC=2367.06

We see the auto.arima functions gives the parameters (0,1,2)(1,1,0). We can run this model as well and see a plot.

#model choice given by auto.arima. 
model2 = Arima(ts_dis.red, order=c(0,1,2), seasonal=list(order=c(1,1,0), period=52), method="CSS") 

plot(forecast(model2,h=52), main = "Forcast for Discount Store - Auto.Model")

Now that we’ve seen the process for how to build a SARIMA model we can begin to build our final demand forecast.

Total Sales Forecast

We proceed just as before.

#Load in the data and make it timeseries data. 
total_without_size = read.delim('total_without_size.txt', header = TRUE, sep = "\t", dec = ".")
ts_tot = ts(discount_without_size$transaction_quantity, start=c(2016,1), freq=52)

#remove the last entry since the week is incomplete. 
ts_tot.red = ts_tot[1:202]

plot(ts_tot.red, type = "l",main="Total Transactions", xlab="Week's Since 2016-01-03",ylab="Transactions")

#take seasonal differences
plot(diff(ts_tot.red,52),type="l",main="Diff(TS,12)", xlab="Week's Since 2016-01-03",ylab="Transactions")

#Still a trend, so take neiboring differences
plot(diff(diff(ts_tot.red,52)),type="l",main="Diff(Diff(TS,12),1)", xlab="Week's Since 2016-01-03",ylab="Transactions")

b = acf2(diff(diff(ts_tot.red,52)), 100)

Based on the ACF and PACF plot we get the model parameters should be (2,1,2)(1,1,1)

model = Arima(ts_tot.red, order=c(2,1,2), seasonal=list(order=c(1,1,1), period=52), method="CSS") 
summary(model)
## Series: ts_tot.red 
## ARIMA(2,1,2)(1,1,1)[52] 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2     sar1     sma1
##       -0.6055  0.2713  0.1270  -0.9941  -0.0153  -0.6906
## s.e.   0.0274  0.0092  0.0299   0.0338   0.0759   0.1459
## 
## sigma^2 estimated as 177984:  part log likelihood=-1142.55
## 
## Training set error measures:
##                     ME    RMSE      MAE       MPE     MAPE      MASE
## Training set -27.36349 354.963 182.8613 -2.127001 5.776429 0.1916225
##                      ACF1
## Training set 0.0008177134
plot(forecast(model,h=52), main = "Forcast for Total Sales - My Model")

Using the auto.arima function gives:

#this model agrees with a grid search over BIC
auto.arima(ts_tot)
## Series: ts_tot 
## ARIMA(0,1,2)(1,1,0)[52] 
## 
## Coefficients:
##           ma1      ma2     sar1
##       -0.5443  -0.3921  -0.2776
## s.e.   0.0827   0.0819   0.1011
## 
## sigma^2 estimated as 397528:  log likelihood=-1173.52
## AIC=2355.04   AICc=2355.32   BIC=2367.06

which gives the parameters (0,1,2)(1,1,0).

#model choice given by auto.arima. 
model2 = Arima(ts_tot.red, order=c(0,1,2), seasonal=list(order=c(1,1,0), period=52), method="CSS") 

plot(forecast(model2,h=52), main = "Forcast for Total Sales - Auto.Model")

Testing on our training set

To choose which model to chose, we compare with previous data. We can look at how the models are able to predict the last 52 weeks of the training set. The lines in red are our known training set, and the blue represents our prediction.

#Load in the data and make it timeseries data. 
total_without_size = read.delim('total_without_size.txt', header = TRUE, sep = "\t", dec = ".")
ts_tot = ts(discount_without_size$transaction_quantity, start=c(2016,1), freq=52)

#remove the last entry since the week is incomplete. 
ts_tot.test = ts_tot[1:150]

model = Arima(ts_tot.test, order=c(2,1,2), seasonal=list(order=c(0,1,1), period=52), method="CSS") 
summary(model)
## Series: ts_tot.test 
## ARIMA(2,1,2)(0,1,1)[52] 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2     sma1
##       -0.5541  0.1413  0.0034  -0.9509  -0.3154
## s.e.   0.1008  0.1037  0.0352   0.0351   0.1669
## 
## sigma^2 estimated as 389319:  part log likelihood=-760.38
## 
## Training set error measures:
##                    ME     RMSE     MAE       MPE     MAPE      MASE
## Training set 46.70415 488.6538 277.491 0.6620145 8.411441 0.3133448
##                     ACF1
## Training set -0.01298397
plot(forecast(model,h=52), main = "Forcast for Total Sales TEST- My Model")
lines(ts_tot[1:201], type="l", col="red")

#model choice given by auto.arima. 
model2 = Arima(ts_tot.test, order=c(0,1,2), seasonal=list(order=c(1,1,0), period=52), method="CSS") 

plot(forecast(model2,h=52), main = "Forcast for Total Store TEST- Auto.Model")
lines(ts_tot[1:201], type="l", col="red")

myforecast = data.frame(forecast(model,52))
autoforecast = data.frame(forecast(model2,52))

sum(abs(myforecast$Point.Forecast-ts_tot[151:202]))
## [1] 36953.72
sum(abs(autoforecast$Point.Forecast-ts_tot[151:202]))
## [1] 30710.75

Since the error for the model used by Auto.Arima has a smaller error, we chose it as our 52-week demand forcast.

demand_forecast = data.frame(forecast(model2,52))

We now incorporate the size data into our model. My plan for this is the following: Create a 52-week demand forecast for all 8 sizes. Each size, then reprsents a proportion of the total amount of predicted sized transactions. We multiply this proportion by our prediction given in total sales to finally get a prediction number for each size, for each week.

ALL the code for each one of these models is build using the exact same method shown above.

Large

total_with_size = read.delim('total_with_size.txt', header = TRUE, sep = "\t", dec = ".")
ts_large = ts(total_with_size$L.REG, start=c(2016,1), freq=52)
ts_large.red = ts_large[1:202]
b = acf2(diff(diff(ts_dis.red,52)), 100)

model = Arima(ts_large.red, order=c(2,1,2), seasonal=list(order=c(1,1,1), period=52), method="CSS") 
summary(model)
## Series: ts_large.red 
## ARIMA(2,1,2)(1,1,1)[52] 
## 
## Coefficients:
##          ar1     ar2      ma1     ma2     sar1     sma1
##       0.2627  0.1851  -0.5755  -0.557  -0.0218  -0.6540
## s.e.  0.0333  0.0882      NaN     NaN   0.0096   0.1719
## 
## sigma^2 estimated as 291.7:  part log likelihood=-664.73
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.1470356 14.36995 7.360439 -10.94277 22.06488 0.3711611
##                       ACF1
## Training set -0.0007727724
plot(forecast(model,h=52), main = "Forcast for Large Coats - My Model")

auto.arima(ts_large)
## Series: ts_large 
## ARIMA(0,1,0)(1,1,0)[52] 
## 
## Coefficients:
##          sar1
##       -0.1751
## s.e.   0.1139
## 
## sigma^2 estimated as 652.1:  log likelihood=-694.51
## AIC=1393.01   AICc=1393.09   BIC=1399.02
model2 = Arima(ts_large.red, order=c(2,1,1), seasonal=list(order=c(0,1,1), period=52), method="CSS") 
plot(forecast(model2,h=52), main = "Forcast for Large Store - Auto.Model")

#We choose my model.
forecast_large = data.frame(forecast(model,h=52))
demand_forecast = cbind(demand_forecast,forecast_large$Point.Forecast)

Medium

ts_med = ts(total_with_size$M.REG, start=c(2016,1), freq=52)
ts_med.red = ts_med[1:202]
b = acf2(diff(diff(ts_med.red,52)), 100)

model = Arima(ts_med.red, order=c(3,1,2), seasonal=list(order=c(1,1,1), period=52), method="CSS") 
summary(model)
## Series: ts_med.red 
## ARIMA(3,1,2)(1,1,1)[52] 
## 
## Coefficients:
##           ar1      ar2      ar3     ma1     ma2     sar1    sma1
##       -0.8698  -0.9328  -0.4695  0.8737  0.6678  -0.5969  0.4423
## s.e.   0.1196   0.0864   0.0883  0.1171  0.0922   0.0474  0.2175
## 
## sigma^2 estimated as 236.7:  part log likelihood=-649.43
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 0.1764305 12.89933 6.185121 -7.227211 26.07649 0.312757
##                     ACF1
## Training set -0.01024261
plot(forecast(model,h=52), main = "Forcast for Medium Coats - My Model")

auto.arima(ts_med)
## Series: ts_med 
## ARIMA(2,1,4)(0,1,0)[52] 
## 
## Coefficients:
##           ar1     ar2      ma1      ma2     ma3     ma4
##       -0.1889  0.6507  -0.3337  -0.9703  0.2565  0.0976
## s.e.   0.4312  0.3379   0.4411   0.4939  0.1991  0.1252
## 
## sigma^2 estimated as 862.9:  log likelihood=-712.67
## AIC=1439.34   AICc=1440.14   BIC=1460.37
model2 = Arima(ts_med.red, order=c(2,1,2), seasonal=list(order=c(1,1,0), period=52), method="CSS") 
plot(forecast(model2,h=52), main = "Forcast for Medium Coats - Auto.Model")

forecast_medium = data.frame(forecast(model2,h=52))
demand_forecast = cbind(demand_forecast, forecast_medium$Point.Forecast)

Small

ts_small = ts(total_with_size$S.REG, start=c(2016,1), freq=52)
ts_small.red = ts_small[1:202]
b = acf2(diff(diff(ts_small.red,52)), 100)

model = Arima(ts_small.red, order=c(3,1,1), seasonal=list(order=c(0,1,1), period=52), method="CSS") 
summary(model)
## Series: ts_small.red 
## ARIMA(3,1,1)(0,1,1)[52] 
## 
## Coefficients:
##          ar1     ar2     ar3      ma1     sma1
##       0.3931  0.0432  0.1443  -0.9762  -0.1668
## s.e.  0.0909  0.0932  0.0907   0.0225   0.1051
## 
## sigma^2 estimated as 225.8:  part log likelihood=-614.15
## 
## Training set error measures:
##                    ME     RMSE     MAE MPE MAPE      MASE        ACF1
## Training set 1.100382 12.68656 6.43006 Inf  Inf 0.5703628 -0.01032663
plot(forecast(model,h=52), main = "Forcast for Small Coats - My Model")

auto.arima(ts_small)
## Series: ts_small 
## ARIMA(1,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ma1      ma2     ma3     mean
##       0.8247  -0.4586  -0.0703  0.1673  24.0780
## s.e.  0.0645   0.0878   0.0713  0.0815   4.8919
## 
## sigma^2 estimated as 390.5:  log likelihood=-887.12
## AIC=1786.25   AICc=1786.68   BIC=1806.1
model2 = Arima(ts_small.red, order=c(1,0,1), seasonal=list(order=c(0,0,1), period=52), method="CSS") 
plot(forecast(model2,h=52), main = "Forcast for Small Coats - Auto.Model")

forecast_small = data.frame(forecast(model,h=52))
demand_forecast = cbind(demand_forecast, forecast_small$Point.Forecast)

XL

ts_XXL = ts(total_with_size$XXL.REG, start=c(2016,1), freq=52)
ts_XXL.red = ts_XXL[1:202]
b = acf2(diff(diff(ts_XXL.red,52)), 100)

model = Arima(ts_XXL.red, order=c(2,1,1), seasonal=list(order=c(0,1,0), period=52), method="CSS")
summary(model)
## Series: ts_XXL.red 
## ARIMA(2,1,1)(0,1,0)[52] 
## 
## Coefficients:
##          ar1      ar2      ma1
##       0.0935  -0.0370  -0.8163
## s.e.  0.1222   0.1093   0.0904
## 
## sigma^2 estimated as 36.17:  part log likelihood=-478.23
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE         ACF1
## Training set 0.2636462 5.112776 3.253087 -Inf  Inf 0.7084188 -0.002819882
plot(forecast(model,h=52), main = "Forcast for XXL Coats - My Model")

auto.arima(ts_XXL)
## Series: ts_XXL 
## ARIMA(1,0,1)(0,0,1)[52] with non-zero mean 
## 
## Coefficients:
##          ar1      ma1    sma1    mean
##       0.8531  -0.5162  0.4703  9.0614
## s.e.  0.0594   0.0966  0.1007  1.6863
## 
## sigma^2 estimated as 29.95:  log likelihood=-634.67
## AIC=1279.35   AICc=1279.65   BIC=1295.89
model2 = Arima(ts_XXL.red, order=c(1,0,1), seasonal=list(order=c(0,1,1), period=52), method="CSS") 
plot(forecast(model2,h=52), main = "Forcast for XXL Coats - Auto.Model")

forecast_XXL = data.frame(forecast(model,h=52))
demand_forecast  = cbind(demand_forecast, forecast_XXL$Point.Forecast)

XXL

ts_XXL = ts(total_with_size$XXL.REG, start=c(2016,1), freq=52)
ts_XXL.red = ts_XXL[1:202]
b = acf2(diff(diff(ts_XXL.red,52)), 100)

model = Arima(ts_XXL.red, order=c(2,1,1), seasonal=list(order=c(0,1,0), period=52), method="CSS")
summary(model)
## Series: ts_XXL.red 
## ARIMA(2,1,1)(0,1,0)[52] 
## 
## Coefficients:
##          ar1      ar2      ma1
##       0.0935  -0.0370  -0.8163
## s.e.  0.1222   0.1093   0.0904
## 
## sigma^2 estimated as 36.17:  part log likelihood=-478.23
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE         ACF1
## Training set 0.2636462 5.112776 3.253087 -Inf  Inf 0.7084188 -0.002819882
plot(forecast(model,h=52), main = "Forcast for XXL Coats - My Model")

auto.arima(ts_XXL)
## Series: ts_XXL 
## ARIMA(1,0,1)(0,0,1)[52] with non-zero mean 
## 
## Coefficients:
##          ar1      ma1    sma1    mean
##       0.8531  -0.5162  0.4703  9.0614
## s.e.  0.0594   0.0966  0.1007  1.6863
## 
## sigma^2 estimated as 29.95:  log likelihood=-634.67
## AIC=1279.35   AICc=1279.65   BIC=1295.89
model2 = Arima(ts_XXL.red, order=c(1,0,1), seasonal=list(order=c(0,1,1), period=52), method="CSS")
plot(forecast(model2,h=52), main = "Forcast for XXL Coats - Auto.Model")

forecast_XXL = data.frame(forecast(model,h=52))
demand_forecast  = cbind(demand_forecast, forecast_XXL$Point.Forecast)

The last 3 sizes 3XL, XS, and XXS are so infequently bought, that a time-series model is not the best model, and especially not a seasonal arima model. There are few of the coats purchased, and at irregular times, that it makes sense to supply them periodically as necessary. ###3XL

ts_3XL = ts(total_with_size$X3XL.REG, start=c(2016,1), freq=52)
ts_3XL.red = ts_3XL[1:202]
plot(ts_3XL.red, type = "l", Main = "3XL Transactions", xlab="Week's Since 2016-01-03",ylab="Transactions")

sum(total_with_size$X3XL.REG)/202
## [1] 0.4059406
forecast_3XL = c(rep(.4,52))
forecast_3XL = data.frame(forecast_3XL)
demand_forecast = cbind(demand_forecast,forecast_3XL)

XS

ts_XS = ts(total_with_size$XS.REG, start=c(2016,1), freq=52)
ts_XS.red = ts_XS[1:202]
plot(ts_XS.red, type = "l", Main = "XS Transactions", xlab="Week's Since 2016-01-03",ylab="Transactions")

sum(total_with_size$XS.REG)/202
## [1] 0.8267327
forecast_XS = c(rep(.827,52))
forcast_XS = data.frame(forecast_XS)
demand_forecast = cbind(demand_forecast,forecast_XS )

XXS

ts_XXS = ts(total_with_size$XXS.REG, start=c(2015,52), freq=52)
ts_XXS.red = ts_XXS[1:202]
plot(ts_XXS.red, type = "l",Main = "XS Transactions", xlab="Week's Since 2016-01-03",ylab="Transactions")

sum(ts_XXS.red)/202
## [1] 0.02475248
forecast_XXS = c(rep(0.024,52))
forecast_XXS = data.frame(forecast_XXS)
demand_forecast = cbind(demand_forecast, forecast_XXS)

We now use these forecasts to create proportions of the total sales allocated to each size. To do this we run a nested for loop. The heart the matter is the function inside the loops. We makes sure the columns are named appropriatedly, and round each prediction to the nearest integer. We write out prediction to a CSV and TXT file. I also have printed the prediction here.

names(demand_forecast) = c("Total_Sales","Low 80","High 80","Low 95","High 95", "L","M","S","XL","XXL","3XL", "XS", "XXS")

#We now have a demand forcast, but we need a proportional estimate for sizes. 
size_proportions = data.frame(matrix(0, ncol = 8, nrow = 52))
for (i in 1:52)
{
  for (j in 1:8)
  {
    size_proportions[i,j] = (demand_forecast[i,5+j]/sum(demand_forecast[i,6:13]))*demand_forecast[i,1]
  }
}

demand_forecast = cbind(demand_forecast$Total_Sales, demand_forecast$`Low 80`, demand_forecast$`High 80`, demand_forecast$`Low 95`, demand_forecast$`High 95`, size_proportions)
names(demand_forecast) = c("Point Forecast", "Low 80", "High 80", "Low 95", "High 95", "L","M","S","XL","XXL","3XL","XS","XXS")
demand_forecast = round(demand_forecast)

start = as.Date("2019-11-11")
dates = c(start)
for (i in 1:51)
{
  dates = c(dates,start+i*7)
}
dates = data.frame(dates)
demand_forecast = cbind(dates,demand_forecast)
names(demand_forecast) = c("Week_Start_Date","Point_Forecast", "Low_80", "High_80", "Low_95", "High_95", "L","M","S","XL","XXL","3XL","XS","XXS")

write.csv(demand_forecast,"demand_forecast.csv",row.names = FALSE)
write.table(demand_forecast, "demand_forecast.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)
demand_forecast
##    Week_Start_Date Point_Forecast Low_80 High_80 Low_95 High_95    L    M
## 1       2019-11-11          11636  11151   12122  10894   12379 3356 3889
## 2       2019-11-18           4380   3864    4896   3591    5168 1039 1481
## 3       2019-11-25           5774   5226    6321   4936    6611 1641 2113
## 4       2019-12-02           7007   6428    7585   6122    7891 2416 2200
## 5       2019-12-09           9716   9109   10324   8788   10645 3278 3334
## 6       2019-12-16           6560   5925    7195   5590    7531 1804 2352
## 7       2019-12-23           3760   3099    4421   2749    4771  996 1311
## 8       2019-12-30           2471   1785    3158   1421    3521  668  997
## 9       2020-01-06           2257   1546    2968   1169    3344  658  700
## 10      2020-01-13           2135   1400    2869   1011    3258  575  791
## 11      2020-01-20           2348   1590    3105   1189    3506  658  944
## 12      2020-01-27           1926   1147    2706    734    3119  579  601
## 13      2020-02-03           3254   2452    4055   2028    4480 1113 1096
## 14      2020-02-10           3754   2932    4577   2496    5012 1061 1141
## 15      2020-02-17           2988   2145    3831   1699    4278  871  983
## 16      2020-02-24           2847   1984    3710   1527    4167  867  879
## 17      2020-03-02           3182   2299    4065   1832    4532 1046 1009
## 18      2020-03-09           2727   1825    3629   1347    4106  898  901
## 19      2020-03-16           2667   1746    3588   1259    4075  779  978
## 20      2020-03-23           2283   1344    3222    846    3719  620  710
## 21      2020-03-30           2385   1428    3343    921    3849  769  806
## 22      2020-04-06           2607   1632    3582   1116    4098  768  921
## 23      2020-04-13           2836   1844    3828   1318    4354 1052  928
## 24      2020-04-20           2668   1659    3678   1124    4212  975  912
## 25      2020-04-27           2791   1765    3818   1222    4361  780  943
## 26      2020-05-04           3023   1980    4066   1428    4618 1034 1132
## 27      2020-05-11           3676   2617    4735   2056    5296 1160 1162
## 28      2020-05-18           3994   2919    5069   2350    5638 1086 1314
## 29      2020-05-25           2435   1344    3526    766    4103  789  856
## 30      2020-06-01           3134   2028    4241   1442    4827 1116  998
## 31      2020-06-08           3375   2253    4497   1659    5091  927 1149
## 32      2020-06-15           4040   2903    5177   2301    5779 1173 1258
## 33      2020-06-22           5043   3891    6195   3281    6805 1430 1879
## 34      2020-06-29           3227   2060    4394   1443    5011  900 1157
## 35      2020-07-06           3301   2120    4482   1494    5108  921 1206
## 36      2020-07-13           3994   2798    5190   2165    5823 1459 1457
## 37      2020-07-20           4695   3485    5905   2845    6546 1187 1531
## 38      2020-07-27           4800   3576    6024   2928    6672 1431 1682
## 39      2020-08-03           4454   3216    5692   2561    6348 1304 1357
## 40      2020-08-10           4216   2964    5467   2301    6130 1331 1538
## 41      2020-08-17           4405   3139    5670   2470    6340 1316 1360
## 42      2020-08-24           4523   3245    5802   2568    6479 1321 1466
## 43      2020-08-31           2344   1052    3637    368    4320  738  843
## 44      2020-09-07           3458   2153    4764   1462    5454 1044 1054
## 45      2020-09-14           4746   3428    6065   2730    6762 1434 1417
## 46      2020-09-21           4395   3064    5726   2359    6431 1252 1354
## 47      2020-09-28           5110   3766    6454   3054    7165 1349 1643
## 48      2020-10-05           4567   3211    5924   2492    6642 1417 1380
## 49      2020-10-12           3975   2606    5344   1881    6069 1244 1339
## 50      2020-10-19           3953   2572    5335   1840    6066 1103 1368
## 51      2020-10-26           4163   2769    5557   2031    6295 1208 1406
## 52      2020-11-02           3563   2157    4970   1413    5714 1245 1272
##       S   XL  XXL 3XL XS XXS
## 1  2178 1096 1096   7 14   0
## 2   907  466  466   6 13   0
## 3  1082  458  458   7 14   0
## 4  1592  388  388   7 15   0
## 5  1744  670  670   7 14   0
## 6  1568  411  411   5 10   0
## 7   962  238  238   5 10   0
## 8   511  136  136   7 15   0
## 9   538  172  172   5 11   0
## 10  503  125  125   5 10   0
## 11  366  182  182   5 11   0
## 12  449  141  141   5 10   0
## 13  603  211  211   6 13   0
## 14  772  380  380   6 13   0
## 15  625  244  244   6 13   0
## 16  610  237  237   6 12   0
## 17  408  350  350   6 12   0
## 18  423  244  244   6 11   0
## 19  540  175  175   7 14   0
## 20  424  255  255   6 11   0
## 21  430  179  179   7 15   0
## 22  490  202  202   8 16   0
## 23  473  178  178   8 17   1
## 24  450  153  153   8 17   0
## 25  611  214  214   9 19   1
## 26  417  203  203  11 22   1
## 27  652  337  337   9 18   1
## 28  832  361  361  13 26   1
## 29  492  127  127  14 30   1
## 30  608  182  182  16 33   1
## 31  633  313  313  13 27   1
## 32  689  435  435  16 34   1
## 33  873  401  401  19 39   1
## 34  660  233  233  14 29   1
## 35  535  298  298  14 29   1
## 36  622  201  201  18 36   1
## 37  972  477  477  17 34   1
## 38 1018  305  305  19 39   1
## 39  912  418  418  14 30   1
## 40  771  268  268  13 26   1
## 41  830  434  434  10 20   1
## 42 1099  304  304  10 20   1
## 43  469  134  134   8 17   0
## 44  635  348  348   9 19   1
## 45 1303  284  284   8 16   0
## 46  977  395  395   7 14   0
## 47 1124  488  488   6 12   0
## 48  886  431  431   7 15   0
## 49  677  349  349   6 12   0
## 50  805  331  331   5 11   0
## 51  926  302  302   6 12   0
## 52  552  236  236   7 14   0